Skip to content

Fix group name lookup crash for flattened mesh collectives in debug_helpers#364

Merged
fmassa merged 1 commit intomainfrom
fmassa/fix_group_name_lookup_cost
Mar 16, 2026
Merged

Fix group name lookup crash for flattened mesh collectives in debug_helpers#364
fmassa merged 1 commit intomainfrom
fmassa/fix_group_name_lookup_cost

Conversation

@fmassa
Copy link
Contributor

@fmassa fmassa commented Mar 13, 2026

make_custom_runtime_estimation assumed every collective's process group corresponds to a single mesh dimension, using groups_name.index(group_name) to resolve it. This crashes with ValueError when a collective uses the flattened mesh group created by mesh._flatten() in _apply_placement_common, since that group isn't in mesh.get_all_groups().

The fix resolves the group name against per-dimension groups first, then falls back to mesh._flatten() (which is cached/idempotent) with mesh_dim=0. All downstream shape and topology references use cost_mesh so the cost estimation is correct for both per-dimension and flattened collectives.

Authored with Claude.

…elpers

make_custom_runtime_estimation assumed every collective's process group
corresponds to a single mesh dimension, using groups_name.index(group_name)
to resolve it. This crashes with ValueError when a collective uses the
flattened mesh group created by mesh._flatten() in _apply_placement_common,
since that group isn't in mesh.get_all_groups().

The fix resolves the group name against per-dimension groups first, then falls
back to mesh._flatten() (which is cached/idempotent) with mesh_dim=0. All
downstream shape and topology references use cost_mesh so the cost estimation
is correct for both per-dimension and flattened collectives.

Authored with Claude.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 13, 2026
@fmassa
Copy link
Contributor Author

fmassa commented Mar 16, 2026

Test failures are the same as in #366 and thus are unrelated to this PR, merging

@fmassa fmassa merged commit 3300fa4 into main Mar 16, 2026
10 of 12 checks passed
@fmassa fmassa deleted the fmassa/fix_group_name_lookup_cost branch March 16, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant